AITopics | linguistic confidence

Collaborating Authors

linguistic confidence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Can Large Language Models Express Uncertainty Like Human?

Tao, Linwei, Yeh, Yi-Fan, Kai, Bo, Dong, Minjing, Huang, Tao, Lamb, Tom A., Yu, Jialin, Torr, Philip H. S., Xu, Chang

arXiv.org Artificial IntelligenceSep-30-2025

Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Y et existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we 1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and 2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we 3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we 4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction. The code and dataset are anonymously available at https://anonymous. Large language models (LLMs) are increasingly deployed in real-world applications, from education and healthcare to law and scientific discovery. While their capabilities make them powerful assistants, LLMs are also prone to hallucinations and factual errors, and human overreliance on their outputs can lead to serious consequences. For instance, a U.S. lawyer once submitted fabricated cases generated by ChatGPT, resulting in professional sanctions (ABC News, 2023). Recent social experiments demonstrate that people adjust their reliance on AI depending on how confident the model appears: reliable expressions of uncertainty can enhance trust, satisfaction, and task accuracy (Kim et al., 2024; Xu et al., 2025). These findings highlight the importance of associating reliable uncertainty estimates with LLM responses to support human decision-making. Ultimately, the conveyance of confidence plays a central role in shaping trust and guiding human-AI interaction. A growing body of work explores the extraction and representation of confidence in LLM outputs. These methods are simple and inexpensive but require access to model logits, which are typically unavailable in commercial LLM APIs. However, such scores rarely align with common user behavior or natural communication, as users do not typically phrase queries with explicit instructions like "Please output your confidence along with the answer."

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.24202

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (0.70)
Media > Television (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation

Shrivastava, Vaishnavi, Liang, Percy, Kumar, Ananya

arXiv.org Artificial IntelligenceNov-15-2023

To maintain user trust, large language models (LLMs) should signal low confidence on examples where they are incorrect, instead of misleading the user. The standard approach of estimating confidence is to use the softmax probabilities of these models, but as of November 2023, state-of-the-art LLMs such as GPT-4 and Claude-v1.3 We first study eliciting confidence linguistically -- asking an LLM for its confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. We then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. Surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on GPT-4). As large language models (LLMs) are increasingly deployed, it is important that they signal low confidence on examples where they are likely to make mistakes. This paper's goal is to produce good confidence estimates for state-of-the-art LLMs, which do not provide model probabilities or representations (such as GPT-4 and Claude-v1.3). We first examine a natural idea of eliciting linguistic confidence scores (Tian et al., 2023; Lin et al., 2022; Xiong et al., 2023) -- prompting the LLM to assess its confidence in its answer (Figure 1, GPT-4 Linguistic). We find that linguistic confidences work reasonably well for state-of-the-art models, and much better than a random guessing baseline, but still leave room for improvement (Section 3). Averaged across the datasets, GPT-4 achieves a selective classification AUC of 80.5%, which is 7% above a random guessing baseline. Our results hold across 12 standard datasets (8 MMLU datasets, TruthfulQA, CommonsenseQA, OpenbookQA, and MedQA), 5 models (GPT-4, Claude-v1.3,

confidence score, linguistic confidence, probability, (17 more...)

arXiv.org Artificial Intelligence

2311.08877

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > Promising Solution (0.54)
Research Report > New Finding (0.34)

Industry:

Information Technology > Security & Privacy (0.69)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Linguistic calibration through metacognition: aligning dialogue agent responses with expected correctness

Mielke, Sabrina J., Szlam, Arthur, Boureau, Y-Lan, Dinan, Emily

arXiv.org Artificial IntelligenceDec-29-2020

Open-domain dialogue agents have vastly improved, but still confidently hallucinate knowledge or express doubt when asked straightforward questions. In this work, we analyze whether state-of-the-art chit-chat models can express metacognition capabilities through their responses: does a verbalized expression of doubt (or confidence) match the likelihood that the model's answer is incorrect (or correct)? We find that these models are poorly calibrated in this sense, yet we show that the representations within the models can be used to accurately predict likelihood of correctness. By incorporating these correctness predictions into the training of a controllable generation model, we obtain a dialogue agent with greatly improved linguistic calibration.

calibrator, correctness, linguistic confidence, (15 more...)

arXiv.org Artificial Intelligence

2012.14983

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Mexico (0.14)
North America > United States > Pennsylvania (0.04)
(20 more...)

Genre: Research Report (0.64)

Industry:

Materials > Metals & Mining > Steel (1.00)
Government > Regional Government > North America Government > United States Government (0.94)
Health & Medicine (0.94)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media (0.94)
(2 more...)

Add feedback